Padhai Time

Handling Numeric Data

A Dataset can have multiple problems with it and before putting the data into a machine learning model or before doing any statistical analysis, it is required that data should be cleaned first. It can have issues such as

Inappropriate column names
Inappropriate column data types
Missing Values in data
Outliers in data
Duplicate data

As you are already familiar with titanic dataset, but we have added a few more scenarios into it for this tutorial purpose.

1. Inappropriate Column names

Column names should be consistent enough across the dataset. Sometimes there can be spaces in between the column names, or may be in the start or at the end, and it becomes hard to see through naked eyes. One of the good approaches is trying “.columns” with the data frame name, it outputs the column names with quotes.

Code:

df = pd.read_csv(“data_set.csv”)

print(df.columns)

Output:

Index(['Passenger Id', 'Survived ', 'Pclass', 'Name', 'Gender', 'Age', 'Fare', 'Embarked'], dtype='object')As you can clearly see, there is space present in PassengerId column and space present at the end of Survived column.

How to Fix?

We can rename the column names as per our wish.

df.rename(columns = { "Passenger Id" : "Passenger_Id",

"Survived ": "Survived"

}, inplace = True)

2) Inappropriate column data types

A user name column can not have integer or float data type, similarly age or salary column can not have their data type as text.

But when we see such scenarios, there is a requirement of fixing the issue.

Let us check the data types of the given data set:

Code:

df.dtypes

Output:

Passenger_Id int64

Survived object

Pclass int64

Name object

Gender object

Age float64

Fare int64

Embarked object

All the data types are correct except the Survived column. As it contains 0 (not survived) or 1 (survived), then the data type of this column must be int or float. Let us look at the data for this column.

Code:

df[‘Survived’].value_counts(dropna = False)

Output:

0 12

1 11

male 1

As we can see, other than 0 and 1, the erroneous value i.e. ‘male’ is also present.

How to Fix?

- Either you can drop this particular record from the data set itself

df = df[df[‘Survived’] != ‘male’]

- Or You can set the erroneous values to Null and later on you can impute the Null values

df['Survived'] = pd.to_numeric(df['Survived'], errors='coerce')

In the highlighted row, it is clear that after the fix, the erroneous value has been replaced with NaN and the column type has been changed to float from text.

There are 3 options available with errors argument, you can try all of them and can see their behaviour.

errors: {‘ignore’, ‘raise’, ‘coerce’}, default is ‘raise’

If ‘raise’, then invalid data values will raise an exception
If ‘coerce’, then invalid data values are replaced with NaN
If ‘ignore’, then invalid data values are skipped and are kept untouched

3) Missing values in Data

Columns can have some of their data values as missing.

We can drop those particular records from the data set itself, but this may lead to information loss. Alternatively, we can impute the missing data values.

Imputation (filling out the missing values) can be performed by:

Filling the missing values with mean value of that column
Filling the missing values with mode value of that column
Filling the missing values with median value of that column

mean_value = df[‘age’].mean()

df[‘age’] = df[‘age’].fillna(mean_value)

4) Outliers in Data

Outliers are extreme values whose chance of occurrence is low. A Data set can have outliers, and while performing any analysis, it is necessary to handle outliers because outliers in data can affect your analysis deeply.

Examples:

An Old man with age of 200 years
An athlete with height as 250 cms
An employee with Salary as 2 Crore per Annum

So all these are the cases where data values are so extreme that their chance of occurrence is very low, means usual/general data points are different from these cases.

- An old man can usually age between 70 to 90

- An employee salary can usually range from 4 to 10 Lacs per Annum

There are multiple ways to handle the outliers:

1) Easiest way is to drop the records where you find the outlier. But this is the worst treatment as we are losing the information

2) We can cap the Outlier values to some threshold.

Set all the ages to 80 if it crosses 80 years of age

Set all the salaries to 15 if it crosses 15 Lacs

3) We can use transformations. Transformation like log, changes the scale of data and outliers remains no longer an outlier.

4) We can assign outliers to some other category.

For example, we can set all the rare data points value as -1

5) Duplicate Data

When the data set contains duplicate records, it affects the result. Suppose there is one particular type of record which is an outlier and it has multiple duplicate records of its kind present in the dataset, then it will adversely affect your analysis. So, it is recommended to drop the duplicates before performing your analysis.

To drop duplicate records:

df.drop_duplicates( [‘customer_id’], axis = 1, inplace = True)

This will make sure that data set should contain only single entry per customer

df.drop_duplicates(inplace = True)

This will make sure that each row is different from each other.

Bengaluru, India

contact.padhaitime@gmail.com